Fixed Q-Targets

## Summary

In Q-Learning, we update a guess with a guess , and this can potentially lead to harmful correlations. To avoid this, we can update the parameters w in the network \hat{q} to better approximate the action value corresponding to state S and action A with the following update rule:

\Delta w = \alpha \cdot \overbrace{( \underbrace{R + \gamma \max_a\hat{q}(S', a, w^-)}_{\rm {TD~target}} - \underbrace{\hat{q}(S, A, w)}_{\rm {old~value}})}^{\rm {TD~error}} \nabla_w\hat{q}(S, A, w)

where w^- are the weights of a separate target network that are not changed during the learning step, and ( S , A , R , S' ) is an experience tuple.

Note : Ever wondered how the example in the video would look in real life? See: Carrot Stick Riding .

## Quiz

Which of the following are true? Select all that apply.

The Deep Q-Learning algorithm uses two separate networks with identical architectures.

The Deep Q-Learning algorithm uses two separate networks with different architectures.

Every time we update the primary Q-Network, we immediately update the target Q-Network weights, so that they match after each learning step.

The target Q-Network's weights are updated less often (or more slowly) than the primary Q-Network.

Without fixed Q-targets, we would encounter a harmful form of correlation, whereby we shift the parameters of the network based on a constantly moving target.

SOLUTION:

The Deep Q-Learning algorithm uses two separate networks with identical architectures.
The target Q-Network's weights are updated less often (or more slowly) than the primary Q-Network.
Without fixed Q-targets, we would encounter a harmful form of correlation, whereby we shift the parameters of the network based on a constantly moving target.